For the project of the course Data Preparation and Workflow Management at Tilburg University, we decided to analyze the Airbnb market in Amsterdam and especially if the COVID-19 pandemic had an influence on the required minimum nights of stay. Especially since there are contradicted foundings in literature regarding this subject. A recent article by the New York Times suggested that the minimum nights of stay increased in New York City during the COVID-19 pandemic, whereas research by Kourtit et al. concluded that the minimum night requirements actually decreased during the pandemic. We decided to take a further look at these contradictions, by researching this subject. We collected data from Airbnb in Amsterdam, from 2020 as well as 2022, to see if there is any significant difference in the minimum nights of stay between during and after the COVID-19 pandemic.
Samengevoegd met Intro
We decided to run a linear regression on the variables of interest. The dependent variable, the required minimum nights, is a metric variable and the independent variable, the presence of COVID-19 (present vs. absent) is a non-metric variable. We have data from 2020 and 2022 for 3960 different Airbnb listings (in total 7920 observations). The variable gets the value 1 assigned if the data is from 2020, so when there was COVID-19 in the Netherlands. Following from that, the variable gets the value 0 assigned if the data is from 2022, when the COVID-19 pandemic no longer had far-reaching consequences in the Netherlands. We decided to not only include the minimum nights of stay and the presence of COVID-19, but also added some control variables to our analysis, to see if there are other effects that might play a role. Since these control variables are differing in metric and non-metric variables, we have chosen linear regression over an ANOVA-analysis.
Next to the dependent variable, the minimum_nights, and the independent variable covid, we included some control variables in a first regression. The control variables neighbourhood_num and roomtype_num were converted to factors, in which each number represents a different neighbourhood or roomtype. Next to that, accomodates, price and instant_bookable were included in this regression.
Call:
lm(formula = minimum_nights ~ covid + as.factor(neighbourhood_num) +
as.factor(roomtype_num) + accommodates + price + instant_bookable,
data = df_cleaned)
Residuals:
Min 1Q Median 3Q Max
-4.125 -1.208 -0.465 0.420 56.063
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8135725 0.4647861 3.902 9.62e-05 ***
covidTRUE -0.3726339 0.0650950 -5.724 1.08e-08 ***
as.factor(neighbourhood_num)2 -0.1691364 0.4187420 -0.404 0.686286
as.factor(neighbourhood_num)3 0.5684251 0.4020637 1.414 0.157469
as.factor(neighbourhood_num)4 1.8470679 0.5182352 3.564 0.000367 ***
as.factor(neighbourhood_num)5 0.1645501 0.4031250 0.408 0.683148
as.factor(neighbourhood_num)6 0.2245008 0.4908163 0.457 0.647394
as.factor(neighbourhood_num)7 0.0232328 0.4525262 0.051 0.959056
as.factor(neighbourhood_num)8 0.3722813 0.4200383 0.886 0.375481
as.factor(neighbourhood_num)9 0.2608471 0.4156185 0.628 0.530276
as.factor(neighbourhood_num)10 0.0958234 0.4532241 0.211 0.832560
as.factor(neighbourhood_num)11 0.0529102 0.4089200 0.129 0.897052
as.factor(neighbourhood_num)12 0.3786304 0.8039026 0.471 0.637661
as.factor(neighbourhood_num)13 0.2277393 0.5198490 0.438 0.661335
as.factor(neighbourhood_num)14 0.1334606 0.3999017 0.334 0.738589
as.factor(neighbourhood_num)15 0.1186491 0.3992311 0.297 0.766326
as.factor(neighbourhood_num)16 1.4707543 0.5527789 2.661 0.007815 **
as.factor(neighbourhood_num)17 0.4515336 0.4314885 1.046 0.295383
as.factor(neighbourhood_num)18 0.0214983 0.4353634 0.049 0.960618
as.factor(neighbourhood_num)19 -0.7940233 0.5832449 -1.361 0.173430
as.factor(neighbourhood_num)20 0.6924107 0.4121391 1.680 0.092989 .
as.factor(neighbourhood_num)21 0.4857487 0.4348607 1.117 0.264019
as.factor(neighbourhood_num)22 0.0988388 0.4109316 0.241 0.809931
as.factor(roomtype_num)2 1.9593214 0.2433889 8.050 9.48e-16 ***
as.factor(roomtype_num)3 0.6605957 0.2421359 2.728 0.006382 **
as.factor(roomtype_num)4 -0.0888187 0.4806452 -0.185 0.853398
accommodates -0.0129177 0.0260161 -0.497 0.619535
price -0.0011770 0.0003356 -3.507 0.000456 ***
instant_bookableTRUE -0.4890019 0.0768799 -6.361 2.12e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.81 on 7891 degrees of freedom
Multiple R-squared: 0.07123, Adjusted R-squared: 0.06793
F-statistic: 21.61 on 28 and 7891 DF, p-value: < 2.2e-16
Following from the output from the first regression, we can conclude that a lot of the estimates are not significant in this model.
Call:
lm(formula = minimum_nights ~ covid + as.factor(neighbourhood_num) +
as.factor(roomtype_num) + accommodates + price + instant_bookable,
data = df_cleaned)
Residuals:
Min 1Q Median 3Q Max
-4.125 -1.208 -0.465 0.420 56.063
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8135725 0.4647861 3.902 9.62e-05 ***
covidTRUE -0.3726339 0.0650950 -5.724 1.08e-08 ***
as.factor(neighbourhood_num)2 -0.1691364 0.4187420 -0.404 0.686286
as.factor(neighbourhood_num)3 0.5684251 0.4020637 1.414 0.157469
as.factor(neighbourhood_num)4 1.8470679 0.5182352 3.564 0.000367 ***
as.factor(neighbourhood_num)5 0.1645501 0.4031250 0.408 0.683148
as.factor(neighbourhood_num)6 0.2245008 0.4908163 0.457 0.647394
as.factor(neighbourhood_num)7 0.0232328 0.4525262 0.051 0.959056
as.factor(neighbourhood_num)8 0.3722813 0.4200383 0.886 0.375481
as.factor(neighbourhood_num)9 0.2608471 0.4156185 0.628 0.530276
as.factor(neighbourhood_num)10 0.0958234 0.4532241 0.211 0.832560
as.factor(neighbourhood_num)11 0.0529102 0.4089200 0.129 0.897052
as.factor(neighbourhood_num)12 0.3786304 0.8039026 0.471 0.637661
as.factor(neighbourhood_num)13 0.2277393 0.5198490 0.438 0.661335
as.factor(neighbourhood_num)14 0.1334606 0.3999017 0.334 0.738589
as.factor(neighbourhood_num)15 0.1186491 0.3992311 0.297 0.766326
as.factor(neighbourhood_num)16 1.4707543 0.5527789 2.661 0.007815 **
as.factor(neighbourhood_num)17 0.4515336 0.4314885 1.046 0.295383
as.factor(neighbourhood_num)18 0.0214983 0.4353634 0.049 0.960618
as.factor(neighbourhood_num)19 -0.7940233 0.5832449 -1.361 0.173430
as.factor(neighbourhood_num)20 0.6924107 0.4121391 1.680 0.092989 .
as.factor(neighbourhood_num)21 0.4857487 0.4348607 1.117 0.264019
as.factor(neighbourhood_num)22 0.0988388 0.4109316 0.241 0.809931
as.factor(roomtype_num)2 1.9593214 0.2433889 8.050 9.48e-16 ***
as.factor(roomtype_num)3 0.6605957 0.2421359 2.728 0.006382 **
as.factor(roomtype_num)4 -0.0888187 0.4806452 -0.185 0.853398
accommodates -0.0129177 0.0260161 -0.497 0.619535
price -0.0011770 0.0003356 -3.507 0.000456 ***
instant_bookableTRUE -0.4890019 0.0768799 -6.361 2.12e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.81 on 7891 degrees of freedom
Multiple R-squared: 0.07123, Adjusted R-squared: 0.06793
F-statistic: 21.61 on 28 and 7891 DF, p-value: < 2.2e-16
Before any conclusions can be drawn, we need to perform some robustness checks.
A first option to check for independence, is to create a scatterplot of the residuals against the fitted values from the linear regression model (in this case m1). The residuals should be independent from the variable, but this scatterplot shows us that this is not the case. We can conclude that there is no independence of the residuals.
# Create a scatterplot of predicted vs actual values
ggplot(df_cleaned, aes(x = predicted, y = minimum_nights)) +
geom_point() + # adds points to the plot
geom_abline(intercept = 0, slope = 1, color = "red") + # adds a diagonal line to the plot to visualize where predicted = actual
xlab("Predicted Values") + # adds a label for the x-axis
ylab("Actual Values") + # adds a label for the y-axis
ggtitle("Predicted vs Actual Values Plot") # adds a title to the plotNext to that, we can create a scatterplot of the predicted values against the actual values.
Durbin-Watson test
data: m1
DW = 0.1109, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
Durbin-Watson test
data: m1
DW = 0.1109, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
TYPE HIER: a last thing we can do to check for independence is performing a Durbin-Watson test…
We can use the same plots when checking for homoskedasticity, as for checking independence.
studentized Breusch-Pagan test
data: m1
BP = 93.6, df = 28, p-value = 5.38e-09
studentized Breusch-Pagan test
data: m1
BP = 93.6, df = 28, p-value = 5.38e-09
TYPE HIER: a last thing we can do to check for homoskedasticity is performing a Breusch-Pagan test…
## Making a dataframe with the residuals
residuals <- resid(m1)
residuals_df <- data.frame(residuals = residuals)
# Test for normality of residuals with a histogram
ggplot(residuals_df, aes(x = residuals)) +
geom_histogram(binwidth = 0.5, color = "black", fill = "white") +
xlab("Residuals") + ylab("Frequency") +
ggtitle("Histogram of Residuals")TYPE HIER: Type hier waarom de graph laat zien dat er geen independence is
TYPE HIER: Type hier waarom de graph laat zien dat er geen independence is
TYPE HIER: Type hier waarom de graph laat zien dat er geen independence is
# Create random subsample of 5000 observations, so we are able to run a Shapiro-Wilk normality test (5000 is the maximum sample size)
set.seed(123)
my_subsample <- residuals_df[sample(nrow(residuals_df), 5000), ]
shapiro.test(my_subsample)
Shapiro-Wilk normality test
data: my_subsample
W = 0.48635, p-value < 2.2e-16
TYPE HIER: a last thing we can do to check for normality is performing a Shapiro-Wilk normality test…
We can use the same first plot as used for testing independence and homoskedasticity. We can conclude…
GVIF Df GVIF^(1/(2*Df))
covid 1.062619 1 1.030834
as.factor(neighbourhood_num) 1.263242 21 1.005579
as.factor(roomtype_num) 1.417333 3 1.059852
accommodates 1.508334 1 1.228142
price 1.623853 1 1.274305
instant_bookable 1.189423 1 1.090607
TYPE HIER: one thing we can do to check for multicolinearity is calculating VIFs…
# correlation matrix
cor(df_cleaned[c("covid", "neighbourhood_num", "roomtype_num", "accommodates", "price", "instant_bookable")]) covid neighbourhood_num roomtype_num accommodates
covid 1.000000000 -0.006718571 -0.01926055 0.01126654
neighbourhood_num -0.006718571 1.000000000 -0.06783638 0.01764622
roomtype_num -0.019260549 -0.067836379 1.00000000 -0.21649496
accommodates 0.011266540 0.017646217 -0.21649496 1.00000000
price -0.164739514 0.007828189 -0.28513626 0.51640989
instant_bookable 0.112195055 -0.061621148 0.24110291 -0.06298997
price instant_bookable
covid -0.164739514 0.11219506
neighbourhood_num 0.007828189 -0.06162115
roomtype_num -0.285136264 0.24110291
accommodates 0.516409886 -0.06298997
price 1.000000000 -0.11301022
instant_bookable -0.113010224 1.00000000
# correlation matrix
cor(df_cleaned[c("covid", "neighbourhood_num", "roomtype_num", "accommodates", "price", "instant_bookable")]) covid neighbourhood_num roomtype_num accommodates
covid 1.000000000 -0.006718571 -0.01926055 0.01126654
neighbourhood_num -0.006718571 1.000000000 -0.06783638 0.01764622
roomtype_num -0.019260549 -0.067836379 1.00000000 -0.21649496
accommodates 0.011266540 0.017646217 -0.21649496 1.00000000
price -0.164739514 0.007828189 -0.28513626 0.51640989
instant_bookable 0.112195055 -0.061621148 0.24110291 -0.06298997
price instant_bookable
covid -0.164739514 0.11219506
neighbourhood_num 0.007828189 -0.06162115
roomtype_num -0.285136264 0.24110291
accommodates 0.516409886 -0.06298997
price 1.000000000 -0.11301022
instant_bookable -0.113010224 1.00000000
TYPE HIER: one thing we can do to check for multicolinearity is making a correlation matrix…
# eigenvalues and condition number
eigen(cor(df_cleaned[c("covid", "neighbourhood_num", "roomtype_num", "accommodates", "price", "instant_bookable")]))$values
kappa(model.matrix(m1))[1] 1.7869855 1.0974403 1.0378347 0.9427096 0.6897544 0.4452754
[1] 51687.31
# eigenvalues and condition number
eigen(cor(df_cleaned[c("covid", "neighbourhood_num", "roomtype_num", "accommodates", "price", "instant_bookable")]))$values
kappa(model.matrix(m1))[1] 1.7869855 1.0974403 1.0378347 0.9427096 0.6897544 0.4452754
[1] 51687.31
TYPE HIER: a last thing we can do to check for multicolinearity is calculating eigenvalues and condition number…
TYPE HIER: Type the conclusion here
covid 19 on minimum nights of stay - team 1